Decoding multiple HMM sets using a single sentence grammar

ABSTRACT

For a given sentence grammar, speech recognizers are often required to decode M set of HMMs each of which models a specific acoustic environment. In order to match input acoustic observations to each of the environments, typically recognition search methods require a network of M sub-networks. A new speech recognition search method is described here, which needs only 1 out of the M subnetwork and yet gives the same recognition performance, thus reducing memory requirement for network storage by M-1/M.

FIELD OF THE INVENTION

[0001] This invention relates to speech recognition and moreparticularly to a speech recognition search method.

BACKGROUND OF THE INVENTION

[0002] Speech recognition devices are typically deployed in differentacoustic environments. An acoustic environment refers to a stationarycondition in which the speech is produced. For instance, speech signalcan be produced by male speakers, female speakers, in officeenvironment, in noisy environment.

[0003] A common way of dealing with multiple environment speechrecognition is to train a set of Hidden Markov Models (HMM) for eachenvironment. For example there would be a pronunciation set or networkof HMMs (grammars) for male speakers and a set of HMMs for femalespeakers because the sounds or models for a male speaker are differentfrom a female speaker. At the recognition phase, HMMs of allenvironments are decoded and the recognition result of the environmentgiving the maximum likelihood is considered as final results. Such apractice is very efficient in recognition performance. For example, ifmale/female separate models are not used, with the same amount of HMMparameters, the Word Error Rate (WER) will typically increase 70%.

[0004] More specifically, for a given sentence grammar, the speechrecognizer is required to decode M (the number of environments) sets ofHMMs each of which models a specific acoustic environment. In order toperform acoustic matching with each of the environments, recognitionsearch methods typically (which include state-of-the-art recognizers asHTK 2.0) require a network of M sub-networks, as illustrated in FIG. 1.Requiring M-sets of sentence network makes the recognition device morecostly and requires much more memory.

SUMMARY OF THE INVENTION

[0005] A new speech recognition search method is described here, whichneeds only 1 out of the M subnetwork (sentence network) and yet givesthe same recognition performance, thus reducing memory requirement fornetwork storage by M-1/M. The speech recognition method includes a basicspeaker independent grammar or network (sentence structure) and virtualsymbols representing the network of expanded sets HMM sets where thepronunciation of each symbol is specified by a set of HMM states. Thenew recognizer builds recognition paths defined on the expanded symbols,and accesses the network using base-symbols, through proper conversionfunction that gives the base-symbol of any expanded symbols, and viceversa.

DESCRIPTION OF DRAWING

[0006] In the drawing:

[0007]FIG. 1 illustrates conventional recognizers require large networkto recognize multiple HMM set;

[0008]FIG. 2 illustrates a block diagram of the system according to oneembodiment of the present invention; and

[0009]FIG. 3 illustrates the main program loop.

DESCRIPTION OF PREFERRED EMBODIMENT

[0010] In the present application we refer a node in the networkdescribing the sentence grammar as a symbol. For typical recognizers, asymbol has to be duplicated M-times in the network when M-sets of HMMare used. This is illustrated in FIG. 1, where three sets of sentencenetworks are depicted.

[0011] In accordance with the present invention a network is constructedto represent a merged version of the M networks that is speakerindependent. For the male and female case this would be a merged versionof the male and female networks and be gender-independent. The modelsfor children may also be merged. Other environments may also be merged.We need to further decode specific HMMs such as those for male, femaleand child and combine with the generic (speaker independent) networkwhere for both the male, female and child have the same nodes andtransitions.

[0012] In applicant's method of decoding M HMM sets, two types ofsymbols are distinguished:

[0013] Base-symbols (α): Symbol representing the basic grammar ornetwork (i.e., the network before duplication for M-sets HMM). They havephysical memory space for storage. This is generic (speaker independent)representing the nodes and transitions.

[0014] Expanded-symbols ({tilde over (α)}): Symbols representing thenetwork of M-1 expanded HMM sets. Their existence in the grammar networkis conceptual. This symbol may represents for example the two sets ofHMMs for male and female.

[0015] For each base-symbol in the network, there are M-1 correspondingexpanded-symbols associated. The new recognizer builds recognition pathsdefined on the expanded-symbols, and accesses the network usingbase-symbols, through proper conversion function that gives thebase-symbol of any expanded symbols.

[0016] Referring to FIG. 2 there is illustrated the system according toone embodiment of the present invention. For the male and femalecombined case the generic network represented by the base symbol α isstored in memory 21. This provides the network structure itself. Alsostored in memory 23 is a set of HMMs for male and a set for female forexample. A set of HMMs may also be for child. The base symbol containsthe sentence structure. The process is to identify the HMM to be used.For every incoming speech frame a main loop program performs arecognition path construction and update-observation-probability. Themain loop program (see FIG. 3) includes a path-propagation program 25and an update-observation-probability program 27.

[0017] The function MAIN-LOOP program illustrated in FIG. 3 performsrecognition path construction for every incoming speech frame: MAIN-LOOP(networks, models): Begin For t = 1 to N Do Begin PATH-PROPAGATION(network, models, t): UPDATE-OBSERVATION-PROB (network, models, t); EndEnd

[0018] A path consists of a sequence of symbols, and the pronunciationof each symbol is specified by a set of hidden Markov model states.Consequently, a path can be either within-model-path orcross-model-path, which the decoding procedure constructs for eachsymbol: PATH-PROPAGATION (network, hmms, t): Begin For each active ã atframe t − 1 Do Begin (Δ_(hmm), Δ_(sym), ∀) = get-offsets (ã, network);hmm = hmms[hmm-code(symbol-list(network)[∀]) + Δ_(hmm),];WITHIN-MODEL-PATH (hmm, ·

⁻¹,

CROSS-MODEL-PATH (hmms, network, ã, ∀, Δ_(hmm), Δ_(sym), t, score (

⁻¹, EXIT- STATE)); End End

[0019] where:

[0020] p_(t) ^(s) denotes the storage of path information for theexpanded-symbol s at frame t.

[0021] “get-offsets” gives the offset of HMM (Δ_(hmm)), offset of symbol(Δ_(sym)) and the base-symbol (∀), given {tilde over (α)} and a network.

[0022] “symbol-list” returns the list of symbols of a network.

[0023] “hmm-code” gives index of an hmm, associated to a symbol.

[0024] Score (p, i) gives the score at state i of the symbol storage p.We keep what is the symbol and frame from which we are from t to t−1 andtrace the sequence of the word. The nodes are constructed based on themodel.

[0025] In the search algorithm for each frame time interval 1 to N forframe time t looks back at time t−1 and calculates to find out the basesymbol. See FIG. 2. From this to access the generic network 21 given theexpanded symbol {tilde over (α)} to get the offset of HMM (ΔHMM). Oncethe ΔHMM is determined, the HMM memory 23 can be accessed such that theHMM that corresponds to the male base or female is provided. Once theHMM is obtained the sequence of states within model path is determinedand then the cross model path. The sequence of HMM states is constructedin the recognition path construction 25 in both the within HMM path andthe between models. There are therefore two key functions for decoding,within-model-path construction and cross-model-path construction:WITHIN-MODEL-PATH (hmm, p_(t−1), p_(t)); Begin For each HMM state i ofhmm Do Begin For each HMM state j of hmm Do Begin score (p_(t),j) =score (p_(t−1), i) + a_(ij); from-frame (p_(t),j) = from-frame (p_(t−1),i) from-symbol (p_(t),j) = from-symbol (p_(t−1), i) End End End

[0026] where:

[0027] ∀_(ij) is the transition probability from state i to state j.

[0028] When we do the within HMM path, we need to do the storage of tand t−1. That sentence with the highest score is determined based on thehighest transition log probability. This is done for every state in theHMM. (For each state j in the equation below. Once we arrive at the endwe go back and find out what is the sequence of the symbols that hasbeen recognized. This is stored. CROSS-MODEL-PATH (hmms, network, ã, ∀Δ_(hmm), Δ_(sym), t, *i); Begin For each next symbol s of ∀Do Begin hmm= hmms [hmm-code(symbol-list(network)[s]) + Δ_(hmm)]; For each HMMinitial state j of hmm Do Begin score (p^(□),j) = *i × π (j); from-frame(p^(□),j) = t − 1; from-symbol (p^(□),j) = ã; End End End

[0029] For the cross model path we need for the next symbol s of α weneed to consider all possible next symbols s. This is the true symbol s(knowledge of grammar that tells which symbol follows which symbol). Wedetermine it's initial state or first HMM and we perform the sequence ofHMM states for between states and add the transition probability (logprobability) from one state to another. We use the π symbol for outsidethe states. We go back to the beginning and determine what is the symboland frame from which we are from so that at the end we can go back andcheck the sequence of words. By doing this within and between we haveconstructed all the nodes.

[0030] Finally, once a path is expanded according to the grammarnetwork, its acoustic score is evaluated:

[0031] UPDATE-OBSERVATION-PROB (network, models, t); Begin

[0032] For each active {tilde over (α)} at frame t Do Begin (Δ_(hmm), ∀)= get-true-symbol (ã, network); hmm =hmms[hmm-code(symbol-list(network)[∀]) + Δ_(hmm)]; For each HMM state jof hmm Do Begin Evaluate score (

,j); End calculate score for ã; End End

[0033] where:

[0034] “get-true-symbol” returns the base-symbol of a expanded symbol.

[0035] These are all based on the model. The next step is to look at thespeech to validate by comparison with the actual speech. This is done inthe update-observation-probability program 27. See FIG. 2. We need tofind the HMM and for every HMM state we need to evaluate the scoreagainst the storage area at the time for the symbol α. The highest scoreis used. The best score models are provided.

RESULTS

[0036] This new method has been very effective at reducing the memorysize. Below represents the generic grammar for 1-7 digit strings:

[0037]$digit=(zero|oh|one|two|three|four|five|six|seven|eight|nine)[sil];

[0038] $DIG=$digit[$digit[$digit[$digit[$digit[$digit[$digit]]]]]];

[0039] $SENT=[sil]$DIG[sil];

[0040] It says for we recognize zero or oh or one, or two etc. It alsosays a digit is composed of a single digit, two digits etc. It also saysa sentence is on two etc. digits.

[0041] The grammar for the 1-7 digit strings for the old genderdependent way follows:

[0042]$digit_m=(zero_m|oh_m|one_m|two_m|three_m|four_m|five_m|six_m|seven_m|eight_m|nine_m)[sil_m];

[0043]$DIG=$digit_m[$digit_m[$digit_m[$digit_m[$digit_m[$digit_m[$digit_m]]]]]];

[0044] $SENT_m=[sil_m]$DIG_m[sil_m];

[0045]$digit_f=(zero_f|oh_f|one_f|two_f|three_f|four_f|five_f|six_f|seven_f|eight_f|nine_f)[sil_f];

[0046]$DIG_f=$digit_f[$digit_f[$digit_f[$digit_f[$digit_f[$digit_f[$digit_f]]]]]];

[0047] $SENT_f=[sil_f]$DIG_f[sil_f];

[0048] $S=$SENT_m|$SENT_f;

[0049] This is twice the size of the generic grammar.

[0050] The purpose is to calibrate resource requirement and verify thatthe recognition scores are bit-exact with multiple grammar decoder.Tests are based on ten files, 5 male 5 female.

[0051] For the grammars above, respectively, a single network grammar ofsentence and a multiple (two, one for male, one for female) networkgrammar of sentence.

COMPUTATION REQUIREMENT

[0052] Due to the conversion between base and expanded symbols, thesearch method is certainly more complex than the one requiring M-set ofnetworks. To determine how much more computation is needed for thesentence network memory saving, the CPU cycles of top 20 functions arecounted, and In are shown in Table 1 (excluding three file I/Ofunctions). It can be seen that the cycle: TABLE 1 CPU cycle comparisonfor top time-consuming functions (UltraSPARC-II). Item multiple-grammarSingle-grammar allocate_back_cell 1603752 1603752 coloring_beam 12253231225323 coloring_pending_states* 2263190 2560475 compact_beam_cells2390449 2390449 cross_model_path* 2669081 2847944 fetch_back_cell10396389  10396389  find_beam_index 7880086 7880086 get_back_cell 735328  735328 init_beam_list  700060  700060 logGaussPdf_decode19930695  19930695  log_gauss_mixture 2695988 2695988 mark_cells13794636  13794636  next_cell  898603  898603 path_propagation* 14708781949576 Setconst  822822  822822 update_obs_prob* 5231532 5513276within_model_path 3406688 3406688

[0053] Consumption for most functions stays the same. Only fourfinctions showed slight changes. Table 2 summarizes cycle consumptionand memory usage. The 1.58% is spent on calculating the set-index, andcan be further reduced by storing the indices. However, the percentincrease is so law that at this time it might not be worth-doing toinvestigate the other alternative—CPU efficient implementation. TABLE 2Comparison of multiple-grammar vs. single-grammar (memory-efficientimplementation). Item multiple-grammar single-grammar increase TOPCYCLES 78115500 793520901 1.58 NETWORK SIZE   11728   5853 −50.0

1. A method of speech recognition comprising: decoding multiple HMM setsusing one set of sentence network and recognizing speech using saiddecoded multiple HMM sets.
 2. A speech recognizer comprising: means ordecoding HMM sets using one set of sentence network and a recognizerrecognizing speech using said decoded multiple HMM sets.
 3. The methodof claim 1 wherein the means for decoding includes means for buildingrecognition paths defined on expanded symbols and accessing said networkusing base symbols through a conversion function.
 4. The method of claim3 wherein said decoding includes within model construction and betweenmodel construction. The method of claim 4 wherein said decoding includesupdate-observation-probability.
 5. A speech recognition search methodcomprising: providing a set of generic grammars, providing symbolsrepresenting a network expanded sets and building recognition pathsdefined by the symbols and accessing the network using using basesymbols through proper conversion function that gives the true symbol ofany expanded symbol.
 6. A method of speech recognition comprising thesteps of: providing a generic network containing base symbols; a singleset of HMMs for male and female; building recognition paths defined onvirtual symbols corresponding to base symbols; accessing said genericnetwork using said bbase symbols through conversion function that givesbase symbols for virtual symbols to therefore decode multiple HMM setsusing a single sentence grammar and using said HMM sets to recognizeincoming speech.
 7. The method of claim 6 wherein said building stepincludes for each frame path propagation expansion based on the grammarnetwork and update-observation-probability.
 8. The method of claim 7wherein said path propagation includes getting offset HMMs offsetsymbols and the base symbol for a given expanded symbol and obtainingthe HMM of the previous frame and expanding and storing a sequence setof HMM states both for within model path and cross model path anddetermining the path with the best transition probability.
 9. The methodof claim 8 wherein said update-observation-probability includes gettingthe base symbol of a expanded symbol and validating state by state thebase symbol bbby comparing to speech in the present frame for the basesymbol associated with the virtual symbol.